{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# College Admissions Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using the geostates package" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`geostates` can be used to create choropleth plots of the United States or individual states. It is easy to use\n", "so we will start out with an example to show you some of the ins and outs of the package." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Admissions analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Goal:** To illustrate the power of the package, we will start out by creating a plot that shows how the number of Princeton University acceptances varies by state in the United States." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will start by importing the `pandas` and `geostates` packages." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading in the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this example, we use admissions data on the *Princeton University Class of 2025* from the [Princeton University Undergraduate Admissions Department](https://admission.princeton.edu/apply/admission-statistics). The CSV includes the total number of admits in the United States as of 30 August 2021 broken down by each geography (state)." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# read in the data\n", "admissions_data = pd.read_csv('Desktop/admissions_data_22.csv', index_col='state')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
admits
state
Washington17
Oregon6
California140
Nevada4
Montana1
\n", "
" ], "text/plain": [ " admits\n", "state \n", "Washington 17\n", "Oregon 6\n", "California 140\n", "Nevada 4\n", "Montana 1" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# take a look at what the CSV file looks like\n", "admissions_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analyzing the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's take a look at which states have the most admits by sorting the list by descending values." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
admits
state
New Jersey188
California140
New York133
Massachusetts76
Pennsylvania68
Texas52
Florida50
Connecticut42
Maryland41
Illinois32
\n", "
" ], "text/plain": [ " admits\n", "state \n", "New Jersey 188\n", "California 140\n", "New York 133\n", "Massachusetts 76\n", "Pennsylvania 68\n", "Texas 52\n", "Florida 50\n", "Connecticut 42\n", "Maryland 41\n", "Illinois 32" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# sort the values to see which states have the most admits\n", "sorted_admits_data = admissions_data.sort_values(by='admits', ascending=False)\n", "\n", "# view the first 10 values of the sorted pandas dataframe\n", "sorted_admits_data.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The table above shows that New Jersey, California, and New York have the most number of admits for the Princeton undergraduate class of 2025." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total admits from top three states: 461 students\n", "Total domestic admits: 1145 students\n", "40.26% of domestic admits come from NJ, CA, and NY\n" ] } ], "source": [ "# see what percent of the total number of domestic admits come from these top three states\n", "\n", "# calculate the total number of admits from New Jersey, California, and New York\n", "top_three_total_admits = sorted_admits_data.head(3)['admits'].sum()\n", "print('Total admits from top three states:', top_three_total_admits, 'students')\n", "\n", "# calculate the total number of domestic admits\n", "total_domestic_admits = sorted_admits_data['admits'].sum()\n", "print('Total domestic admits:', total_domestic_admits, 'students')\n", "\n", "# calculate the percent of the total admits that these three states contribute\n", "percent = (top_three_total_admits/total_domestic_admits)\n", "print('{:.2%}'.format(percent), 'of domestic admits come from NJ, CA, and NY')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is interesting! It turns out just **three states** comprise over 40% of the domestic undergraduate admits to Princeton University." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualize the data using geostates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first step for using the `geostates` package is to load in the geodataframe containing all of the state values. For this, we will use the `load_states()` function and assign it to a value `df`. Once we've loaded in the geodataframe we need to merge it with out cattle data." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "# import the load_states() function from the geostates package\n", "from geostates.shapefiles import load_states" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
STATEFPSTATENSAFFGEOIDGEOIDNAMELSADALANDAWATERgeometry
STUSPS
MS28017797900400000US2828Mississippi001215335194813926919758MULTIPOLYGON (((-88.50297 30.21523, -88.49176 ...
NC37010276160400000US3737North Carolina0012592365606413466071395MULTIPOLYGON (((-75.72681 35.93584, -75.71827 ...
OK40011028570400000US4040Oklahoma001776629257233374587997POLYGON ((-103.00257 36.52659, -103.00219 36.6...
VA51017798030400000US5151Virginia001022577171108528531774MULTIPOLYGON (((-75.74241 37.80835, -75.74151 ...
WV54017798050400000US5454West Virginia0062266474513489028543POLYGON ((-82.64320 38.16909, -82.64300 38.169...
\n", "
" ], "text/plain": [ " STATEFP STATENS AFFGEOID GEOID NAME LSAD \\\n", "STUSPS \n", "MS 28 01779790 0400000US28 28 Mississippi 00 \n", "NC 37 01027616 0400000US37 37 North Carolina 00 \n", "OK 40 01102857 0400000US40 40 Oklahoma 00 \n", "VA 51 01779803 0400000US51 51 Virginia 00 \n", "WV 54 01779805 0400000US54 54 West Virginia 00 \n", "\n", " ALAND AWATER \\\n", "STUSPS \n", "MS 121533519481 3926919758 \n", "NC 125923656064 13466071395 \n", "OK 177662925723 3374587997 \n", "VA 102257717110 8528531774 \n", "WV 62266474513 489028543 \n", "\n", " geometry \n", "STUSPS \n", "MS MULTIPOLYGON (((-88.50297 30.21523, -88.49176 ... \n", "NC MULTIPOLYGON (((-75.72681 35.93584, -75.71827 ... \n", "OK POLYGON ((-103.00257 36.52659, -103.00219 36.6... \n", "VA MULTIPOLYGON (((-75.74241 37.80835, -75.74151 ... \n", "WV POLYGON ((-82.64320 38.16909, -82.64300 38.169... " ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# load in the geodataframe and assign it to df\n", "df = load_states()\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Merging the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to sucessfully create a choropleth map of the college admissions data, we need to merge it with the geodataframe that contains all the information for creating the plots of the states. We can do this by using the `pandas merge` function. Since the index for the college admissions data is `state` and our geodataframe contains a similar column (`NAME`) we can use this value to merge both dataframes. Let's start out by renaming the `NAME` column in our geodataframe to `state` so that the names of both columns match." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
STATEFPSTATENSAFFGEOIDGEOIDstateLSADALANDAWATERgeometry
STUSPS
MS28017797900400000US2828Mississippi001215335194813926919758MULTIPOLYGON (((-88.50297 30.21523, -88.49176 ...
NC37010276160400000US3737North Carolina0012592365606413466071395MULTIPOLYGON (((-75.72681 35.93584, -75.71827 ...
OK40011028570400000US4040Oklahoma001776629257233374587997POLYGON ((-103.00257 36.52659, -103.00219 36.6...
VA51017798030400000US5151Virginia001022577171108528531774MULTIPOLYGON (((-75.74241 37.80835, -75.74151 ...
WV54017798050400000US5454West Virginia0062266474513489028543POLYGON ((-82.64320 38.16909, -82.64300 38.169...
\n", "
" ], "text/plain": [ " STATEFP STATENS AFFGEOID GEOID state LSAD \\\n", "STUSPS \n", "MS 28 01779790 0400000US28 28 Mississippi 00 \n", "NC 37 01027616 0400000US37 37 North Carolina 00 \n", "OK 40 01102857 0400000US40 40 Oklahoma 00 \n", "VA 51 01779803 0400000US51 51 Virginia 00 \n", "WV 54 01779805 0400000US54 54 West Virginia 00 \n", "\n", " ALAND AWATER \\\n", "STUSPS \n", "MS 121533519481 3926919758 \n", "NC 125923656064 13466071395 \n", "OK 177662925723 3374587997 \n", "VA 102257717110 8528531774 \n", "WV 62266474513 489028543 \n", "\n", " geometry \n", "STUSPS \n", "MS MULTIPOLYGON (((-88.50297 30.21523, -88.49176 ... \n", "NC MULTIPOLYGON (((-75.72681 35.93584, -75.71827 ... \n", "OK POLYGON ((-103.00257 36.52659, -103.00219 36.6... \n", "VA MULTIPOLYGON (((-75.74241 37.80835, -75.74151 ... \n", "WV POLYGON ((-82.64320 38.16909, -82.64300 38.169... " ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# rename the 'NAME' column in the geodataframe to 'State'\n", "geo_df = df.rename(columns={'NAME': 'state'})\n", "geo_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Important:** To make sure that we do not accidentally loose any important data during the merge, we need to make sure that we include the `how='outer'` parameter in the merge statement." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stateadmitsSTATEFPSTATENSAFFGEOIDGEOIDLSADALANDAWATERgeometry
0Washington1753017798040400000US53530017211258822012559278850MULTIPOLYGON (((-122.57039 48.53785, -122.5686...
1Oregon641011551070400000US4141002486069932706192386935MULTIPOLYGON (((-123.59892 46.25145, -123.5984...
2California14006017797780400000US06060040350393131220463871877MULTIPOLYGON (((-118.60442 33.47855, -118.5987...
3Nevada432017797930400000US3232002843295064702047206072POLYGON ((-120.00574 39.22866, -120.00559 39.2...
4Montana130007679820400000US3030003769627387653869208832POLYGON ((-116.04914 48.50205, -116.04913 48.5...
\n", "
" ], "text/plain": [ " state admits STATEFP STATENS AFFGEOID GEOID LSAD ALAND \\\n", "0 Washington 17 53 01779804 0400000US53 53 00 172112588220 \n", "1 Oregon 6 41 01155107 0400000US41 41 00 248606993270 \n", "2 California 140 06 01779778 0400000US06 06 00 403503931312 \n", "3 Nevada 4 32 01779793 0400000US32 32 00 284329506470 \n", "4 Montana 1 30 00767982 0400000US30 30 00 376962738765 \n", "\n", " AWATER geometry \n", "0 12559278850 MULTIPOLYGON (((-122.57039 48.53785, -122.5686... \n", "1 6192386935 MULTIPOLYGON (((-123.59892 46.25145, -123.5984... \n", "2 20463871877 MULTIPOLYGON (((-118.60442 33.47855, -118.5987... \n", "3 2047206072 POLYGON ((-120.00574 39.22866, -120.00559 39.2... \n", "4 3869208832 POLYGON ((-116.04914 48.50205, -116.04913 48.5... " ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.merge(admissions_data, geo_df, on='state', how='outer')\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting the data" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "# import the plot_states() function from geostates\n", "from geostates.plot import plot_states" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "# create a choropleth map that displays the admits for each state in the United States\n", "# plot = plot_states(data_2, column='admits', cmap=new_cmap, labels='both', linestyle='none', legend='legend',\n", " #bins=15)\n", "\n", "# add a title to the plot\n", "# plot.annotate('Princeton Admissions Data 2022', xy=(-97, 50.5), fontsize=18, ha='center');" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }